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ABSTRACT 


The use of techniques for feature selection allows one to treat classi- 
fication problems in spaces of lower dimension. In this note we consider 
a method of linear feature selection for n dimensional observation vectors 
which belong to one of m populations. Where each population has a known 
apriori probability and is described by a known multivariate normal density 
function. Specifically we consider the problem of finding a k x n matrix 
B of rank k (k < n) for which the transformed probability of misclassi- 
fication is minimized. 

Subject to the condition that the transformed a posterior probabilities 
are distinct we obtain theoretical results which, for the case k = 1, give 
rise to a numerically tractable formula for the derivative of the probability 
of misclassification. It is shown that for the two population problem this 
condition is also necessary. Finally, we investigate the dependence of the 
minimum probability of error on the a priori probabilities and show that the 
minimum probability of error satisfies a uniform Lipschitz condition with 
respect to the a priori probabilities. 
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On Differentiating the Probability of Error 
in the Multipopulation Feature Selection Problem 


1. Introduction 

Let 7T ^ , . . . , it be populations in R n with apriori probabilities 

a l’’‘'’ a m and conditiona l densities PjCx), i = l,...,m, defined for 

, J „n , 

X = (x^,...,x ) e R by 


" ^72 


(27T) n/z |E i 


1/2 


1, >T V -1, . 

- ^(x-^) Cx-y i ) 


If B is a k x n matrix of rank k, then the transformed conditional 

T k 

densities are defined for y = (y^ y^) e R by 


Pi(y,B) 


(2ir) ky,2 |BS.B T | 1 ^ 2 


e y(y-By i ) (B^B 1 ) 1 (y-By 1 ) 


Let g(B) denote the probability of mis classification in R k as a function 
of B, with a Bayes optimal (maximum likelihood) classification rule. 

If B q minimizes g(B) and the Gateaux differential, [3, p. 171], 


6g(B Q ;C) 


~ g(B 0 ) 

llm — — 

s-*o s 


exists for a k x n matrix C, then 6g(B ,C) = 0. Thus it is desirable 

o 

to obtain a formula for <5g(B;C). Such a formula has been obtained for the 
case m = 2, = 1/2, by Guseman and Walker [1], [2]. In this 
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note we obtain a formula for the general case subject to the condition 
that the functions a^P^(y,B) are all distinct. Unless otherwise stated, 
this assumption will be made. 

2. Differentiating the Probability of Error. 

Using a maximum likelihood classification rule, the probability of 
k 

error in R as a function of a feature selection matrix B of rank k 
may be expressed as 


g(B) 



(y,B)dy 


+ 


+ 



(y,B)dy 


where 


f i(y.B) 



(y»B) 


and 

^(B) = {y e R k ja iP:t (y,B) > C^.p (y,B) for all j i i>. 

- (y e Rk | f i< y »B) < fj (y»B) for j 4 i} . 

Since the functions c^P^y.B) are distinct, the R^(B) are dis- 
joint open sets which cover R k except for a set of measure zero; i.e., 
their boundaries. 
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Let 


r(y,B) = min f . (y,B) . 
i 1 

Then R^(B) is the interior of the set 

{y e 

and 

g(B) 

Let C be a k x n matrix. If y e R ± (B) and js| is sufficiently 
small, then 


R k |f 1 (y»B) - r(y,B)} 


/ r(y,B)dy 

/t 


R 


r(y,B + sC) - f ± (y,B + sC) 


Hence, for 


y e R i (B), 


lim 

s-*o 


r (y,B + sC) - r(y,B) 


f . (y,B + sC) - f .(y,B) 
lim — 

S-*0 S 


= 6f i (y,B;C) 

- L a ^ ( y* B;C) 

ji=i 
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Thus, provided that 


11 a f r( y> B - + sC -L - . r i y ^_ B) dy = f lim r(y,B + sC) r(^ B > dy 
S'* 0 R^CB) S R d <B) s ^° 8 


we have 


(2) 


<$g(R;C) = 


m m 

E !>* „ 

i=i i= i ^RrcB) 

m 1 


/ 


Sp £ (y,B;C)dy 


It is shown in [2], that 

(3) <5p £ (y,B;C) = Pjt (y,B){(y - By £ ) T (BS £ B T ) _1 [C^ + 

CE^CbE^ 1 )' 1 ^ - By^) ] - tr [CE^CBE^B 1 )" 1 ] } . 

Combining (2) and (3) gives the required formula for <$g(B;C). For 
k > 1 this formula is numerically intractable because of the integrals 
which appear. For k = 1, however, it is possible to obtain an integral 
free expression for Sg(B;C). Indeed, when k = 1, (3) becomes 

CE 

<$P£(y»B;C) = p^(y,B) ; { — - — y ~2 < y ” B ^> 2 

Cb^o® ) 

/v 

, % , „ , 

+ tt(y ~ ByJ 

BI A B BE £ B 


( 4 ) 
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Integrating (4) by parts yields 


/ Wy* 3 ; 


R^CB) 


CS B 

C)dy = - P^(y,B)[ — ~(y - By^) + Cy^] 


R i (B) 


where I means the sum of the values of the function at the right 

'R ± (B) 

endpoints of the intervals comprising R^(B) minus the sum of its values 
at the left endpoints. Thus, for k = 1, 


m m 


CE J_ B 


(5) -<$g(B;C) = E E a, P, (y,B) [—^(y - By ) + Cy ] 

i=l j=l 33 BE.B J J 

a^i j 


RjCB) 


The remainder of this section is devoted to showing that (1) is 
true. To do this we require three lemmas. The first two of these are 
generalizations of well known facts from calculus and integration theory. 

If f is a real valued function defined in a neighborhood of a real number 
x, let f(x) and f(x) denote respectively its upper and lower derivates 
at x defined by , [4, p.96] , 

/ \ , . f (y) - f (x) 

f(x) = lim sup — £ 

y - x 

f (x) - lim inf ^ 

y->x y - x 

Lemma 1: If f is continuous on an interval [a,bj, then there exists 


ce (a,b) such that 
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f (c) £ " , f< a > * f ( c ). 

— b — a 

Lemma 2 : Let (X,y) be a measure space. Suppose h(y, §1 is a real 

valued function on X x [-6,6] such that for each s, h(y,s) is 
absolutely integrable on X and for each y, h(y,s) is continuous in s. 
Suppose also that there exists an absolutely integrable function S(y) such 
that 


[h s (y,s)| < B(y) 
|b s (y»s) | ^ g(y) 


for all y and s and that for each y, the partial derivative h g (y,o) 
exists. Then 


d_ 

ds 


/ 

X 


h(y,s)dy 



(y,o)dy. 


Proof ; Apply Lemma 1 and the Lebesgue dominated convergence theorem, 

[4, p.229] . 

Lemma 3 : If 6 > o is small enough that B + sC is rank k for js| < J, 
then there exists a function $(y), integrable on R , such that 

|6fj (y,B + sC;C) | < 3(y) 
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for all y e r\ |s| ^ 6, j = 1, ..., m. 

Proof; By (3), 

6f j (y,B + sC;C) = 2Zct^6p^(y,B + sC;C) 

WJ 

* - /L«£P£(y>B + sc){[y - (B + sC) U ^] 1 [ (B + sCJE^CB + sC) T ] 1 
[Cy £ + CZ^(B + sC) T (B + sC^CB + sC) T ) _1 (y - (B + sC)^)] 
-tr[C££<B + sC) T ( (B + sC^CB + sC) T ) _1 ]}. 


Since the means and covariances of the density functions P^(y,B + sC) , 
as well as the coefficients of the terms in { }, are continuous functions 
of s, they form compact sets. From this fact, it is clear that the re- 
quired function 3(y) exists. Since the actual construction of 3(y) is 
tedious it will be omitted. 

Now let h(y,s) = r(y,B + sC) . We want to show that 


_d 

ds 


/• 


h(y,s)dy 


^(B) 


s=o 



(y,o)dy. 


Let 6 > o be small enough that for js[ < 6, B + sC is rank k and 
the functions ajPj(y,B + sC) are all distinct. Let 3(y) be the function 
in Lemma 3. Clearly, h(y,s) is integrable on R^i(B) for each fixed s 
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and continuous on [-6,6] for each fixed y* Thus the result follows 
from Lemma 2 once it is shown that 

(6) |h (y,s)| < g(y) 

IhgCy.s)) ^ 6Cy) 

for all y e R^.(B), |s| £6. For y e R^(B) and |s| s 6, there are 
two possibilities: 

Case 1, y e R (B + sC) for some j: Then h (y,s) = 5f (y,B + sC;C) and 

J s J 

(6) follows from Lemma 3. 

Case 2: y is not in any R^ (B + sC ) 2 Then h(y,s) - ^(y.B + sC) for 

more than one index j. Let J(y) be the set of indices j such that 
My.s) = fj(y,B + sC) . Then for sufficiently small |t| > o 

h(y ,s + t) - r(y,B + sC + tC) = f^(y,B + sC + tC) 

for some j, depending on t, in J(y). Thus, 

h(y,s ± t) - hfy.s) f 1 ( y- B + sC + tC) - + sC > . 

t t 

Since J(y) is a finite set, there are indices j and k in J(y) such 
that 

h g (y,s) = 6fj(y,B + sC;C) 
hgCyjS) = 5f k (y,B + sC;C) 
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and (6) follows again from Lemma 3. 

This concludes the proof. 

3. The Case of Non-Distinct Transformed Densities 

In this section we show that the requirement that the a^.p^(y,B) be 

distinct cannot be eliminated. Specifically, consider a two population problem 

where = a 2 * 1/2, and p^(y,B) = p 2 (y,B); that is, By^ = By 2 and 

= B^B^. Let C be a k x n matrix such that Cy^ ^ Cy 2 or 
T T 

CS-jB ^ CZ 2 B . We will show that 6g(B:C) does not exist. Indeed, using the 
formula 


mintf^f,,} = i[f 1 + f 2 - |f x - f 2 |] 


we see that 


g(B + sC) - y^min(p 1 (y,B+sC) , p 2 (y,B+sC) }dy 
R k 

m j~jf IPj^Cy.B+sC) - p 2 (y,B+sC) fdy. 
R k 


8(B) = i . 


Hence, for s > 0, 


g(B+sC) - g(B) 
s 




4 J s ,r l v " 

R k 


f T7 TU-oP'* - n ( \T 

■ • y 1 


I 




P 1 (y>B+sC) - p^y.B) - p 2 (y,B+sC) - P 2 (y,B) 


[dy 


R 
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which tends to 


- j j\ l*5p 1 (yfB; 


C) - 6p (y,B;C) | dy as s 0. On the other 


hand, for s < 0, 


g(B+sC) - g(B) 
s 


-ifj 


PjCy.B+sC) - p 1 (y,B) - p 2 (y,B+sC) - p 2 (y,B) 


dy 


r 


which tends to 


jJ I^Cy.BjC) 

R k 


<$p 2 (y»B;c)dy. 


Hence <5g(B;C) exists if and only if 



(y»B;C) 


fip 2 (y,B;C)|dy = 0. 


That is, if and only if 6p^(y,B;C) = 5p 2 (y,B;C) almost everywhere. But 

Sp^y^C) = p 1 (y,B){(y-By i ) T (BE 1 B T )" 1 [Cu 1 

+ CE 1 B T (BZ 1 B T )" 1 (y-By 1 )] - trtCE^CBZ^ 1 )" 1 ]}'. 

* <$P 2 (y,B;C) = P 1 (y,B){(y-By 1 ) T (BE 1 B T ) _1 |:Cy 2 

+ CZ 2 B T (BE 1 B T ) _1 (y-By 1 )] - tr [CZ^CBZ^) -1 ] } . 

Since the polynomial parts of these two expressions have different coefficients, 
they cannot be equal almost everywhere. Hence, 6g(B;C) does not exist. 

Notice that the problem of non differentiability does not arise if the 
apriori probabilities are distinct, since the functions a: p (y,B) are 
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distinct in this case. This suggests that if some of the apriori probabilities 
are equal then one might attempt to find a B which nearly minimizes g(B) 
by changing the apriori probabilities slightly and insuring that the new 
apriori probabilities are distinct. The following theorem shows that this 
approach is valid. Let a = denote the vector of apriori prob- 

abilities and write g(B,a) to show the dependence of the probability of error 
on a as well as on the feature selection matrix B. Let f^(y,B) be defined 
as in Section 2, and let 


f(y,B) - i £ 1 a j p. i (y, B ) 


Then 


g(B,a) = / min f (y,B)dy 

R k 1 1 

= / min (f (y ,B) - a 1 p 1 (y,B))dy 
R k 1 

[f (y,B) - max ajpjCy.B) ]dy 


R* 


max a i p i (y,B)dy. 


Theorem: For all a and 3> 


jmin g(B,oi) - min g(B,3)| < lla-6,11 

B B 


where 'll a 3 II - I a, -3, 1 + ... + la -3 I. 

1 1 l 1 1 m m‘ 

Proof : In view of the formula for g(B,a) given above, it clearly suffices 

to show that if q^(y), ..., q^Cy) are probability density functions on R 
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and a, 3 are m-tuples of real numbers, then 


(7) 


/. 


max 
1 l<i£m 


a^q^y) Vi (y) l dy ~ 11®“^ 


This inequality is clear for m = 1. For m > 1 write 


max 

ism 


"A W ' + is ”^ “i q i (5r) 

+ IVm (y) - 1 5^ 1 “l q i (y) l } 


On substituting this and the corresponding expansion for max 3.q.(y) into 

i<m i i 

the left hand side of (7) it follows easily that 



| max a.q (y) - max 3.q. (y)[dy 
i<m 1 1 i<m 1 1 


s |a m“ S J /" q «< y > dy 

f 

+ I [ max a.q. (y) - max 3.q 4 (y)|dy 
J k i<m-l 1 1 i^m-1 1 1 

R 

• l<VAJ + / I Mi(y) - • * ***• B.q.(y)|dy 

m m J j, i<m-l 1 1 i<m-l 1 i 

R 


Thus the result follows by induction. 


4. Concluding Remarks 

It will be shown in a subsequent report that the condition that the 
a i^i^’^o^ be distinct is necessary as well as sufficient for the differentiability 
of g(B) at B q . Thus the following conjecture is of importance whether it 
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is intended to solve the variational equation directly for the minimizing 
B or to use a steepest discent method and use the expression for <5g(B;C) 
developed in Section 2 to compute the gradient at each step. 

Conjecture: If ot^Cx) ? ot^ (x) for i,j = 1 m and B q minimizes 

g(B), then the functions a i P i (y» B 0 ) are distinct. 
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